Steve Elston
09/11/2023
Regression lines draw viewers attention
| Property or Aesthetic | Perception | Data Types |
|---|---|---|
| Aspect ratio | Good | Numeric |
| Regression lines | Good | Numeric plus categorical |
| Marker position | Good | Numeric |
| Bar length | Good | Counts, numeric |
| Sequential color palette | Moderate | Numeric, ordered categorical |
| Marker size | Moderate | Numeric, ordered categorical |
| Line types | Limited | Categorical |
| Qualitative color palette | Limited | Categorical |
| Marker shape | Limited | Categorical |
| Area | Limited | Numeric or categorical |
| Angle | Limited | Numeric |
Problem: Modern data sets are growing in size and complexity
Goal: Understand key relationships in large complex data sets
Difficulty: Large data volume
Difficulty: Large numbers of variables
Note: we will address use of dimensionality reduction techniques in another lesson, time permitting
All scientific graphics are limited to a 2-dimensional projection
But, complex data sets have a great many dimensions
We need methods to project large complex data onto 2-dimensions
Generally, multiple views are required to understand complex data sets
Some chart types are inherently scalable.
Over-plotting occurs in plots when the markers lie one on another.
What can we do about over-plotting?
Down sample to 20%, alpha = 0.1, size = 2
Alternatives to avoid over-plotting for truly large data sets
Example: Density of sale price by time
Example: Contour plot of 2-D KDE of sale price vs. time
Example: Airline passenger counts by month and year displayed by squential heat map
Heat map of
Sometimes a creative alternative is best
Often situation specific; many possibilities
Finding a good one can require significant creativity!
Example, choropleth for mutli-dimensional geographic data
Example, time series of box plots
Example: Time ordered box plots of quarterly sales price
How can we display multidimensional count (categorical) data at scale?
Mosaic plots displays the relative proportion of counts of a contingency table
Plot area is divided and fully filled by a set of tiles
The larger the count, the larger tile area
Example: How do counts change with working vs. non-working day, weather and year?
Mosaic plot displays tiles with size proportional to count
Counts conditioned on variables, working day, weather, year
Facet plot of wind by month
How can we understand the relationships in complex high-dimensional data with many variables?
Must confront limitations for 2-dimensional projection inherent in computer graphics
Need methods to scale to higher dimensions
Apply combination of arrays of plots and filtering
Aesthetics: used to increase the number of projected dimensions
How can we understand the relationships in complex high-dimensional data with many variables?
Arrays of plots: subsets show relationships in a complex data set
Pairwise scatter plots: matrix of all pairwise combinations of variables
Faceting: uses values of categorical or numeric variables to plot subsets
Cognostics: sort large number of variables to find important relationships
Display multiple plot views in an array or grid
Scatter plot matrix used to investigate relationships between a number of variables
Scatter plot matrix create two dimensional array of plots of variable
pairs
- Upper triangular plots: Scatter plots with regression lines
- Lower triangular plots: Hexbin plots of density
- Diagonal plots: Histograms of variables
g = sns.PairGrid(diabetes.drop('Sex', axis=1), hue='sex_categorical', palette="Set2", height=1.5)
_=g.map_upper(sns.regplot, order=1, truncate=True, scatter_kws={'s':0.5})
_=g.map_lower(plt.hexbin, alpha=0.5, cmap='Blues', gridsize=15, linewi dths=0)
_=g.map_diag(plt.hist, histtype="step",linewidth=1)
plt.show()Scatterplot matrix with different plot types
Facet plots revolutionized statistical graphics starting about 30 years ago
Facet plots extend the number of dimensions projected onto 2-d plot surface
Key idea: Create array of plots of subsets of the data
Like many good ideas facet plotting was invented several times
Facet plot projected on a grid
Grid defined by variable values for row and column
Can tile or shingle numeric variables for grid
Can create variables by combinations of categories from other variables
Example: Histogram of wind speed conditioned on month
Facet plot of wind by month
Example: Plot count of riders by hour, conditioned on weather and season
Example: Plot count of riders by hour, conditioned on weather and season
Facet plot of count by season and weather
Example: Plot count of riders by hour, conditioned on weather and season
Example: Box plot counts of riders by hour, conditioned on weather and season
Facet plot of count by season and weather
How can we visualize very high dimensional data?
Modern data sets have thousands to millions of variables
Idea: need to find the most important relationships
Use a cognostic to sort relationship
Idea originally proposed by Tukey, 1982, 1985
## Add an intercept column to the data frame
housing.loc[:,'intercept'] = [1.0] * housing.shape[0]
## Demean the decimal time column
mean_time = housing.loc[:,'time_decimal'].mean()
housing.loc[:,'time_demean'] = housing.loc[:,'time_decimal'].subtract(mean_time)
## Find the slope coefficients for each state
def prepare_temp(df, group_value, group_variable = 'state'):
temp = df.loc[df.loc[:,group_variable]==group_value,:].copy()
mean_price = np.mean(temp.loc[:,'log_medSoldPriceSqft'])
temp.loc[:,'log_medSoldPriceSqft'] = np.subtract(temp.loc[:,'log_medSoldPriceSqft'], mean_price)
std_price = np.std(temp.loc[:,'log_medSoldPriceSqft'])
temp.loc[:,'log_medSoldPriceSqft'] = np.divide(temp.loc[:,'log_medSoldPriceSqft'], std_price)
return(temp, mean_price, std_price)
def compute_slopes(df, column, group_variable='state'):
slopes = []
entities = []
intercepts = []
for e in df.loc[:,column].unique():
temp, mean_price, std_price = prepare_temp(df, e, group_variable=column)
temp_OLS = sm.OLS(temp.loc[:,'log_medSoldPriceSqft'],temp.loc[:,['intercept','time_demean']]).fit()
slopes.append(temp_OLS.params.time_demean)
intercepts.append(temp_OLS.params.intercept)
entities.append(e)
slopes_df = pd.DataFrame({'slopes':slopes, 'intercept_coef':intercepts, 'entity_name':entities})
slopes_df.sort_values(by='slopes', ascending=False, inplace=True)
slopes_df.reset_index(inplace=True, drop=True)
return slopes_df
#compute_slopes(housing, 'state')
state_slopes = compute_slopes(housing, 'state')
## PLot states with the fastest growing pricing
def find_changes(df, slopes, start, end, col='state'):
increase = slopes.loc[start:end,'entity_name']
increase_df = df.loc[df.loc[:,col].isin(increase),:]
increase_df = increase_df.merge(slopes, how='left', right_on='entity_name', left_on=col)
return(increase_df, increase)
big_increase_states, increase_states = find_changes(housing, state_slopes, 0, 7)
## Display scatterplot vs time
def plot_price_by_entity(df, order, entity='state', xlims=[2007.5, 2016.5]):
g = sns.FacetGrid(df, col=entity, col_wrap=4, height=2, col_order=order)
g = g.map(sns.regplot, 'time_decimal', 'log_medSoldPriceSqft',
line_kws={'color':'red'}, scatter_kws={'alpha': 0.1, 's':0.5})
g.set(xlim=(xlims[0],xlims[1]))
plt.show()
_=plot_price_by_entity(big_increase_states, increase_states)States with greatest increase in housing price
We have explored these key points
Proper use of plot aesthetics enable projection of multiple
dimensions of complex data onto the 2-dimensional plot surface.
All plot aesthetics have limitations which must be understood to
use them effectively
The effectiveness of a plot aesthetic varies with the type and the application
Visualization of modern data sets, growing in size and complexity
Visualization limited by 2-dimensional projection
Goal: Understand key relationships in large complex data sets
Difficulty: Large data volume
Difficulty: Large numbers of variables